6 research outputs found
ViperGPT: Visual Inference via Python Execution for Reasoning
Answering visual queries is a complex task that requires both visual
processing and reasoning. End-to-end models, the dominant approach for this
task, do not explicitly differentiate between the two, limiting
interpretability and generalization. Learning modular programs presents a
promising alternative, but has proven challenging due to the difficulty of
learning both the programs and modules simultaneously. We introduce ViperGPT, a
framework that leverages code-generation models to compose vision-and-language
models into subroutines to produce a result for any query. ViperGPT utilizes a
provided API to access the available modules, and composes them by generating
Python code that is later executed. This simple approach requires no further
training, and achieves state-of-the-art results across various complex visual
tasks.Comment: Website: https://viper.cs.columbia.edu
Doubly Right Object Recognition: A Why Prompt for Visual Rationales
Many visual recognition models are evaluated only on their classification
accuracy, a metric for which they obtain strong performance. In this paper, we
investigate whether computer vision models can also provide correct rationales
for their predictions. We propose a ``doubly right'' object recognition
benchmark, where the metric requires the model to simultaneously produce both
the right labels as well as the right rationales. We find that state-of-the-art
visual models, such as CLIP, often provide incorrect rationales for their
categorical predictions. However, by transferring the rationales from language
models into visual representations through a tailored dataset, we show that we
can learn a ``why prompt,'' which adapts large visual representations to
produce correct rationales. Visualizations and empirical experiments show that
our prompts significantly improve performance on doubly right object
recognition, in addition to zero-shot transfer to unseen tasks and datasets
NTIRE 2018 Challenge on Single Image Super-Resolution: Methods and Results
This paper reviews the 2nd NTIRE challenge on single image super-resolution (restoration of rich details in a low resolution image) with focus on proposed solutions and results. The challenge had 4 tracks. Track 1 employed the standard bicubic downscaling setup, while Tracks 2, 3 and 4 had realistic unknown downgrading operators simulating camera image acquisition pipeline. The operators were learnable through provided pairs of low and high resolution train images. The tracks had 145, 114, 101, and 113 registered participants, resp., and 31 teams competed in the final testing phase. They gauge the state-of-the-art in single image super-resolution